Introduction¶
The FIFA series is a collection of footbal simulation video games developed by EA Sports. Each year, a new game is released for millions of fans around the world.
The game is well known for its close resemblance to real life football, adding many features that mimic the professional sport. One of these features is including transfer market value information for each player. This value represents how much clubs would have to pay in order to aquire a certain player.Objectives¶
The goal of this project is to gather and analyze FIFA player data then predict the transfer market value tier of each player through machine learning.
Summary¶
In this project we will be following the machine learning workflow to tackle this investigation. Below are the highlights of each step.
Cells |
Section |
Highlights |
|---|---|---|
| 1-44 | Data Engineering |
|
| 45-50 | Exploratory Data Analysis |
|
| 51-79 | Feature Engineering |
|
| 80-82 | Data Modeling |
|
| 83-85 | Model Evaluation and Performance |
|
Creating and defining table on PostgreSQL database and ingesting the data¶
import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres password=padrepio1410")
cur = conn.cursor()
# cur.execute("""
# DROP TABLE IF EXISTS fifa19;
# CREATE TABLE fifa19 (
# index integer,
# id integer,
# name text,
# age integer,
# photo text,
# nationality text,
# flag text,
# overall integer,
# potential integer,
# club text,
# club_logo text,
# mkt_value text,
# wage text,
# special integer,
# preferred_foot text,
# international_rep integer,
# weak_foot integer,
# skill_moves integer,
# work_rate text,
# body_type text,
# real_face text,
# position text,
# jersey_number integer,
# joined text,
# loaned_from text,
# contract_valid_until text,
# height text,
# weight text,
# ls text,
# st text,
# rs text,
# lw text,
# lf text,
# cf text,
# rf text,
# rw text,
# lam text,
# cam text,
# ram text,
# lm text,
# lcm text,
# cm text,
# rcm text,
# rm text,
# lwb text,
# ldm text,
# cdm text,
# rdm text,
# rwb text,
# lb text,
# lcb text,
# cb text,
# rcb text,
# rb text,
# crossing integer,
# finishing integer,
# headingaccuracy integer,
# shortpassing integer,
# volleys integer,
# dribbling integer,
# curve integer,
# fkaccuracy integer,
# longpassing integer,
# ballcontrol integer,
# acceleration integer,
# sprintspeed integer,
# agility integer,
# reactions integer,
# balance integer,
# shotpower integer,
# jumping integer,
# stamina integer,
# strength integer,
# longshots integer,
# aggression integer,
# interceptions integer,
# positioning integer,
# vision integer,
# penalties integer,
# composure integer,
# marking integer,
# standingtackle integer,
# slidingtackle integer,
# gkdiving integer,
# gkhandling integer,
# gkkicking integer,
# gkpositioning integer,
# gkreflexes integer,
# release_clause text);
# """)
# conn.commit()
# #data ingestion
# # with open('./fifa19.csv',"r") as f:
# # cur.copy_from(f,'fifa19',sep=',',null="")
# # conn.commit()
# basic operations
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
Querying data from PostgreSQL database
sql_command = 'SELECT * FROM fifa19'
df = pd.read_sql(sql_command, conn)
conn.close ()
df.head(10)
df.shape
df.info()
This dataset contains over 80 fields, some not applying to our analysis.
We'll drop some fields that are less meaningful.
df.drop(columns=['index','id','photo','flag','club_logo','special','real_face','jersey_number','contract_valid_until','release_clause'],inplace=True)
The table seems to have many missing values. We'll have to properly handle these values before we can continue with modifying the dataset
# List of fields with missing values and count
df.isnull().sum().sort_values(ascending=False)
We'll deal with the loaned_from column first as it is almost fully missing
A null values under this column simply means these players are not on loan, which makes sense why majority have a null value. Dropping this column makes sense, however I'm interested in seeing if loans affect player wage. So we'll simply change this column to boolean values
df['on_loan'] = (df['loaned_from'].notnull()).astype(int)
df['on_loan'].unique()
#dropping df['on_loan'] as it is not needed anymore
df.drop(columns=['loaned_from'],inplace=True)
Next, we'll deal with the club column and fill missing values with 'No Club'
df['club'].fillna(value='No Club',inplace=True)
There appears to be many players with 'No Club', I assume that these player will have little value and no wage. If this is the case these players should be dropped as they will skew the results.
Let's investigate further
len(df['club'][df['club'] == 'No Club'])
df_noClub = df[(df['wage'] == '€ 0') & (df['club']=='No Club')][['name','club','mkt_value','wage']]
df_noClub.head()
print('Number of players with no wage and club: ',len(df_noClub))
The results align with our hypothesis, showing that all players with no club indeed have low transfer market value and no wage
These players only make up a small portion of the dataset so removing them makes sense.
##Dropping these players after analysis
drop_list = drop_list=df[(df['wage']=='€ 0') & (df['club']=='No Club')].index
df.drop(index=drop_list,inplace=True)
Next, to deal with the
joinedcolumn we'll we'll create a new column named 'months_at_club' which stores the number of months a player has been at that club for. For missing values we'll assign a value of 0.
from datetime import datetime
def num_months(df_value):
try:
year = datetime.strptime(df_value,"%d-%b-%y")
delta = datetime.now()-year
return divmod(delta.days,30.436875)[0]
except (ValueError, TypeError):
value = 0
return value
df['months_at_club'] = df['joined'].apply(num_months)
## Dropping df['joined'] as it not needed anymore
df.drop(columns=['joined'],inplace=True)
df[['name','club','months_at_club']].head()
Remaining null values listed below
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
Looking at the remaining list, majority of the columns are player attributes (
sprintspeed,dribbling,volleysetc.) and have the same number of missing valuesLet's explore this by filtering for missing attribute values only
## Acceleration was randomly chosen from attributes to analyze
df_null = df[(df['acceleration'].isnull())]
df_null.head()
columns = set(df.columns) #list of columns
non_null_list = df_null.dropna(axis=1, how='all').columns #list of non null columns in above df
non_null_count = len(non_null_list)
total_column_count = len(df.columns)
null_count = total_column_count - non_null_count
remaining = [x for x in columns if x not in non_null_list]
print('Number of records: {}'.format(len(df_null)))
print('List of fields that only contain null entries in the above df:',"\n")
print(remaining,"\n")
print("Number of null fields: {} of {}".format(null_count,total_column_count))
print("Percentage: {:.2%}".format(null_count/total_column_count))
Interesting, over 85% of total fields in the filtered table have no values in them.
These players will not help with our analysis as they do not have enough information. In total there are only 48 players, therefore dropping these rows will not have a large effect, so we'll go ahead and remove them from the table.
drop_list=df[df['acceleration'].isnull()].index
df.drop(index=drop_list,inplace=True)
Remaining null values listed below
df.isnull().sum()[df.isnull().sum() > 0].sort_values(ascending=False)
The rest of the null values are only from positions other than
GK.I assume these players are all Goalkeepers but we'll check to be sure.
df_gk = df[df['position']=='GK']
df_gk.sample(10).iloc[:,14:43]
len(df_gk)
This confirms our hypothesis. Let's fill the remaining nulls with a value of 0
df.fillna(value=0,inplace = True);
df.isnull().sum().sum()
Our dataset now does not contain any missing values!
We can now proceed with data cleaning and manipulation
Converting appropriate categorical features to numerical features¶
Currency values are listed in Euros and not in standard notation, making them hard to work with.
Let's write some functions to convert these values.
df[['mkt_value','wage']].head()
def convert(df_value):
try:
value = float(df_value[1:-1])
suffix = df_value[-1:]
conversion = 1
if suffix == 'M':
value = value * 1000000 * conversion
elif suffix == 'K':
value = value * 1000 * conversion
except (ValueError, TypeError):
value = 0
return value
df['mkt_value'] = df['mkt_value'].apply(convert)
df['wage'] = df['wage'].apply(convert)
#result
df[['mkt_value','wage']].head()
Similarly,
weightandheightare listed in lbs and feet/inches respectively.Let's convert those to numerical values.
df[['weight','height']].head()
def heightConverter(df_value):
feet = df_value.split("'")[0]
inches = df_value.split("'")[1]
height = (int(feet) * 30.48) + (int(inches)*2.54)
return height
def weightConverter(df_value):
weight = int(df_value[:-3])
return weight
df['height(cm)'] = df['height'].apply(heightConverter)
df['weight(lbs)'] = df['weight'].apply(weightConverter)
df.drop(columns=['height','weight'],inplace=True)
df[['height(cm)','weight(lbs)']].head()
Next, we'll convert the position ratings to numerical values as well and group them into attacking and defensive ratings
df[['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam',
'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']].head()
def ratingConverter(df_value):
df_value = str(df_value)
if "+" in df_value:
v1 = df_value[0:2]
v2 = df_value[-1]
df_value = int(v1) + int(v2)
return df_value
else:
return int(df_value)
atr_columns = ['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam',
'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']
for col in atr_columns:
df[col] = df[col].apply(ratingConverter)
df[['ls', 'st', 'rs', 'lw', 'lf', 'cf', 'rf', 'rw', 'lam', 'cam',
'ram', 'lm', 'lcm', 'cm', 'rcm', 'rm', 'lwb', 'ldm', 'cdm', 'rdm',
'rwb', 'lb', 'lcb', 'cb', 'rcb', 'rb']].head()
df['att_rate'] = (df['rf'] + df['st'] + df['lf'] + df['rs'] + df['ls'] + df['cf']\
+ df['lw'] + df['rcm'] + df['lcm'] + df['cam'] + df['rm']\
+ df['lam'] + df['lm'] + df['rw'] + df['cm'] + df['ram'] ) / 16
df['def_rate'] = (df['rcb'] + df['cb'] + df['lcb'] + df['lb'] + df['rb'] + df['rwb']\
+ df['lwb']+ df['ldm']+ df['rdm'] + + df['cdm']) / 10
df.drop(columns=['rf', 'st', 'lw', 'rcm', 'lf', 'rs', 'rcb', 'lcm', 'cb', 'ldm', 'cam', 'cdm',
'ls', 'lcb', 'rm', 'lam', 'lm', 'lb', 'rdm', 'rw', 'cm', 'rb', 'ram', 'cf', 'rwb', 'lwb'
], inplace=True)
df[['name','att_rate','def_rate']].head(4)
As seen below, some player body types are inconsistent and don't comply with the standard classification: ('
Lean','Normal','Stocky')
df[~df['body_type'].isin(('Lean', 'Normal','Stocky'))][['name','body_type']]
Let's change these values so that
body_typeis consistent.
df['body_type'][df['body_type'] == 'Messi'] = 'Lean'
df['body_type'][df['body_type'] == 'C. Ronaldo'] = 'Normal'
df['body_type'][df['body_type'] == 'Neymar'] = 'Lean'
df['body_type'][df['body_type'] == 'Courtois'] = 'Lean'
df['body_type'][df['body_type'] == 'PLAYER_BODY_TYPE_25'] = 'Normal' ##PLAYER_BODY_TYPE_25 corresponds to M.Salah
df['body_type'][df['body_type'] == 'Shaqiri'] = 'Stocky'
df['body_type'][df['body_type'] == 'Akinfenwa'] = 'Stocky'
To make the
positionfield data easier to work with, we'll classify player positions into three categories F(attacking), M(midfield), D(defensive) and GK(goalkeeper)
df['position'].unique()
def positionConverter(df_value):
if df_value in ["RF","ST","LW","LF","RS","LS","RW","CF"]:
df_value = 'F'
return df_value
elif df_value in ["RCM","LCM","LDM","CAM","CDM","RM","LAM","LM","RDM","CM","RAM"]:
df_value = 'M'
return df_value
elif df_value in ["RCB","CB","LCB","LB","RB","RWB","LWB"]:
df_value = 'D'
return df_value
else:
return df_value
df['position'] = df['position'].apply(positionConverter)
df[['name','position']].sample(5)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
pd.options.display.float_format = '{:.2f}'.format
What are the major nationalities in FIFA 19?¶
df_countries = df.groupby(by='nationality').size().reset_index()
df_countries.columns = ['country','count']
fig = go.Figure(data=go.Choropleth(
locationmode = 'country names',
locations = df_countries['country'],
z = df_countries['count'],
text = df_countries['country'],
colorscale = 'Cividis',
autocolorscale=False,
marker_line_color='darkgray',
marker_line_width=0.5,
colorbar_title = 'Number of Players',
))
fig.update_layout(
title_text='Player count Distribution per Country',
geo=dict(
showframe=False,
showcoastlines=False,
projection_type='equirectangular'
)
)
fig.show(renderer='notebook')
Most players are from European and South American countries
Top 5 countries
- England - 1657
- Germany - 1195
- Spain - 1071
- Argentina - 936
- France -911
df_club = df[['club','mkt_value']].groupby(['club'],as_index=False).sum()
df_club = df_club.sort_values('mkt_value',ascending=False).head(10)
fig = go.Figure(go.Bar(
y=df_club['club'],
x=df_club['mkt_value'],
orientation='h'))
fig.update_layout(title_text='Total Value by Club')
fig.show(renderer='notebook')
No surprise that Real Madrid and FC Barcelona have the top spots as they have the likes of Messi and Ronaldo - the highest valued players in the game
fig = go.Figure(data=go.Scatter(
y = df['mkt_value'],
x=df['age'],
mode='markers',
marker = dict(color=df['overall'],colorbar=dict(title='Overall'))
))
fig.update_layout(title_text = "Value vs Age vs Overall")
fig.update_xaxes(title_text='Age(Years)')
fig.update_yaxes(title_text='Value (Euros)')
fig.show(renderer='notebook')
I
Correlation Analysis¶
cat_col = df.select_dtypes(include=object).columns
num_col = df.select_dtypes(exclude=object).columns
print('Numerical features: ',num_col.values)
Here is a list of all numerical features from our table
'age', 'overall', 'potential', 'mkt_value', 'wage', 'international_rep', 'weak_foot', 'skill_moves', 'crossing', 'finishing', 'headingaccuracy', 'shortpassing', 'volleys', 'dribbling', 'curve', 'fkaccuracy', 'longpassing', 'ballcontrol', 'acceleration', 'sprintspeed', 'agility', 'reactions', 'balance', 'shotpower', 'jumping', 'stamina', 'strength', 'longshots', 'aggression', 'interceptions', 'positioning', 'vision', 'penalties', 'composure', 'marking', 'standingtackle', 'slidingtackle', 'gkdiving', 'gkhandling', 'gkkicking', 'gkpositioning', 'gkreflexes', 'on_loan', 'months_at_club', 'height(cm)', 'weight(lbs)', 'att_rate', 'def_rate'
To better determine correlation between player attributes and other features, we'll group the features into attacking, defensive and goalkeeping categories
df_1 = df.copy()
df_1['ovr_pace'] = df[['acceleration','sprintspeed']].mean(axis=1)
df_1['ovr_shooting'] = df[['positioning','finishing','fkaccuracy',\
'shotpower','longshots','volleys','penalties']].mean(axis=1)
df_1['ovr_passing'] = df[['vision','crossing','fkaccuracy',\
'shortpassing','longpassing','curve']].mean(axis=1)
df_1['ovr_dribbling'] = df[['agility','balance','reactions',\
'ballcontrol','dribbling','composure']].mean(axis=1)
df_1['ovr_defending'] = df[['interceptions','headingaccuracy',\
'marking','standingtackle','slidingtackle']].mean(axis=1)
df_1['ovr_physical'] = df[['jumping','stamina','strength','aggression']].mean(axis=1)
df_1['ovr_gk'] = df[['gkdiving','gkhandling','gkkicking','gkpositioning','gkreflexes']].mean(axis=1)
remove = ['acceleration','sprintspeed','positioning','finishing','fkaccuracy','shotpower','longshots','volleys','penalties',
'vision','crossing','fkaccuracy','shortpassing','longpassing','curve', 'agility','balance','reactions','ballcontrol',\
'dribbling','composure','interceptions','headingaccuracy','marking','standingtackle','slidingtackle','jumping',\
'stamina','strength','aggression','gkdiving','gkhandling','gkkicking','gkpositioning','gkreflexes']
df_1.drop(columns=remove,inplace = True)
df_cat = df_1.select_dtypes(include=object).columns
df_num = df_1.select_dtypes(exclude=object).columns
print('Remaining numerical features in table: ', df_num.values)
Remaining numerical features in table
age' 'overall' 'potential' 'mkt_value' 'wage' 'international_rep' 'weak_foot' 'skill_moves' 'on_loan' 'months_at_club' 'height(cm)' 'weight(lbs)' 'att_rate' 'def_rate' 'ovr_pace' 'ovr_shooting' 'ovr_passing' 'ovr_dribbling' 'ovr_defending' 'ovr_physical' 'ovr_gk'
corr = df_1.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(20, 9))
# Colourway
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,
square=True, linewidths=0.1, cbar_kws={"shrink": .5}, annot=True,annot_kws={'size':8},fmt='.2')
fig = go.Figure(data=go.Scatter(
y = df_1['mkt_value'],
x=df_1['overall'],
mode='markers',
marker = dict(color=df_1['wage'],colorbar=dict(title='Wage'))
))
fig.update_layout(title_text = "Overall vs Market Value vs Wage")
fig.update_xaxes(title_text='Overall')
fig.update_yaxes(title_text='Market Value')
fig.show(renderer='notebook')
value_corr = df_1.corr().abs()['mkt_value'].sort_values(ascending=True)
fig = go.Figure(data=go.Scatter(
y = value_corr.values ,
x=value_corr.index,
mode='markers'
))
fig.update_layout(title_text = "Correlation with Label/Dependent Variable")
fig.update_xaxes(title_text='Field')
fig.update_yaxes(title_text='Correlation')
fig.show(renderer='notebook')
df.drop(columns=['international_rep','overall','wage'],inplace=True)
df['mkt_value'].describe()
plt.figure(1, figsize=(18, 7))
ax = sns.countplot( x= 'mkt_value', data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, fontsize=5)
plt.title('Market Value Distribution')
--
Observations¶
mkt_value values are messy but we can still detect outliers having extremely high transfer market values.
These players can be grouped as 'Superstar' players, like Messi, Ronaldo, Hazard etc.
The distribution is multi modal
Most high rated players with an abnormal transfer market value seem to be causing a wide spreadApproach¶
Remove outliers and recompute the mean
That is, remove 90 percentile and above and recalculate mean value
realistic_value = df['mkt_value'][df['mkt_value'] < df['mkt_value'].quantile(0.9)]
realistic_value.describe()
This looks a bit better and well choose the mean as the threshold for low and high wage player
threshold = realistic_value.mean()
Now we'll create a modified dataframe with the required information
df.head()
df1 = df.copy()
## setting binary values to low wage and high wage
df1['high_value'] = (df1['mkt_value'] > threshold).astype(int)
df1.drop(columns=['mkt_value','name','nationality','club'],inplace=True)
df1.head()
#splitting player workrates into two columns
#from documentation, first value is attacking work rate and second is defensive work rate
df1['work_rate_a'] = df1['work_rate'].str.split('/',1,expand=True)[0]
df1['work_rate_d'] = df1['work_rate'].str.split('/',1,expand=True)[1]
df1.drop(columns=['work_rate'],inplace=True)
cat_vars = ['preferred_foot','body_type','position','work_rate_a','work_rate_d']
for var in cat_vars:
cat_list = 'var'+'_'+var
cat_list = pd.get_dummies(df1[var],prefix=var)
df2 = df1.join(cat_list)
df1=df2
cat_vars = ['preferred_foot','body_type','position','work_rate_a','work_rate_d']
data_vars=df1.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]
These are the remaining columns in the data set
df_final=df1[to_keep]
df_final.columns.values
Over-sampling using SMOTE¶
The threshold for high market value players was not determined by the 50 percentile of the value data and instead lowered to better split the players This caused an under representation of the high market value players as seen in the bar chart below.
We'll up-sample the high-value players using the SMOTE algorithm(Synthetic Minority Oversampling Technique).¶
plt.figure(1, figsize=(18, 7))
ax = sns.countplot( x= 'high_value', data=df1)
plt.title('Count of High Market Value Players vs Low Market Value Players')
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
X = df_final.loc[:, df_final.columns != 'high_value']
y = df_final.loc[:, df_final.columns == 'high_value']
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['high_value'])
#we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y['high_value']==0]))
print("Number of subscription",len(os_data_y[os_data_y['high_value']==1]))
print("Proportion of no subscription data in oversampled data is ",len(os_data_y[os_data_y['high_value']==0])/len(os_data_X))
print("Proportion of subscription data in oversampled data is ",len(os_data_y[os_data_y['high_value']==1])/len(os_data_X))
Length of oversampled data is 15888
Number of no subscription in oversampled data 7944
Number of subscription 7944
Proportion of no subscription data in oversampled data is 0.5
Proportion of subscription data in oversampled data is 0.5
We now have a perfectly balanced dataset!
Recursive Feature Elimination¶
Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.
df_final_vars=df_final.columns.values.tolist()
y=['high_value']
X=[i for i in df_final_vars if i not in y]
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 10)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
index_list = []
for i,t in enumerate(rfe.support_):
if(t==True):
index_list.append(i)
cols_list = df_final.columns.values
cols_list = np.delete(cols_list,np.where(cols_list == 'high_value'))
cols = []
for i in index_list:
cols.append(cols_list[i])
cols
X=os_data_X[cols]
y=os_data_y['high_value']
The RFE has helped us select the following features:
'skill_moves', 'preferred_foot_Left', 'preferred_foot_Right', 'position_D', 'position_F', 'position_GK', 'position_M', 'work_rate_a_Medium', 'work_rate_d_ Low', 'work_rate_d_ Medium'
import statsmodels.api as sm
df_final_vars=df_final.columns.values.tolist()
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
The p-values for most of the variables are smaller than 0.05, except 1 variables, therefore, we will remove it.
cols2 = ['work_rate_a_High']
cols = [x for x in cols if x not in cols2]
X = os_data_X[cols]
y=os_data_y['high_value']
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Accuracy of logistic regression classifier on test set: 0.70
Confusion Matrix¶
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
ax = plt.subplot()
sns.heatmap(confusion_matrix,annot=True,ax=ax,cmap="Blues",fmt='g',xticklabels=['Positive','Negative'], yticklabels=['Positive','Negative'])
The result is telling us that we have [1624 + 1727] correct predictions and [796 + 607] incorrect predictions.
Let's evaluate the model using model evaluation metrics such as accuracy and precision.
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Interpretation:¶
When our Logistic Regression model predicted players having a high market value, those player had a high value 70% of the time.
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--') # ROC curve of a purely random classifier
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).
AUC score for the case is 0.71. AUC score 1 represents perfect classifier, and 0.5 represents a worthless classifier
Improvements¶
- Investigate multicollinearity in model. Ensure that independent variables are independent of each other
- Gather more training data
- Dimensionality reduction
- More model testing and repeating test. train loop